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O Abstract 

Structured output prediction is an important machine learning problem both in theory and prac- 
C^l tice, and the max-margin Markov network (M 3 N) is an effective approach. All state-of-the-art 

algorithms for optimizing M 3 N objectives take at least 0(l/e) number of iterations to find an e 
accurate solution. Recent results in structured optimization suggest that faster rates are possible 
by exploiting the structure of the objective function. Towards this end 031 proposed an excessive 
gap reduction technique based on Euclidean projections which converges in 0(1/ y/e) iterations 
on strongly convex functions. Unfortunately when applied to M 3 Ns, this approach does not admit 
graphical model factorization which, as in many existing algorithms, is crucial for keeping the cost 
per iteration tractable. In this paper, we present a new excessive gap reduction technique based 
on Bregman projections which admits graphical model factorization naturally, and converges in 
0(1/ y/e) iterations. Compared with existing algorithms, the convergence rate of our method has 
better dependence on e and other parameters of the problem, and can be easily kernelized. 

, 1 Introduction 

In the supervised learning setting, one is given a training set of labeled data points and the aim is to learn 
a function which predicts labels on unseen data points. Sometimes the label space has a rich internal struc- 
ture which characterizes the combinatorial or recursive inter-dependencies of the application domain. It is 
widely believed that capturing these dependencies is critical for effectively learning with structured output. 
Examples of such problems include sequence labeling, context free grammar parsing, and word alignment. 
However, parameter estimation is generally hard even for simple linear models, because the size of the label 
space is potentially exponentially large (see e.g. |3 ]). Therefore it is crucial to exploit the underlying condi- 
tional independence assumptions for the sake of computational tractability. This is often done by defining a 
. . graphical model on the output space, and exploiting the underlying graphical model factorization to perform 

computations. 

Research in structured prediction can broadly be categorized into two tracks: Using a maximum a pos- 
terior estimate from the exponential family results in conditional random fields rCRFs, [TOlL and a maximum 
margin approach leads to max-margin Markov networks [M 3 Ns,[l8]|. Unsurprisingly, these two approaches 
share many commonalities: First, they both minimize a regularized risk with a square norm regularizer. Sec- 
ond, they assume that there is a joint feature map 4> which maps (x, y) to a feature vector in R^. 1 Third, they 
assume a label loss i(y, y*; x z ) which quantifies the loss of predicting label y when the correct label of input 
x* is y\ Finally, they assume that the space of labels y is endowed with a graphical model structure and that 
0(x, y) and £(y, y z ; x z ) factorize according to the cliques of this graphical model. The main difference is in 
the loss function employed. CRFs minimize the L2 -regularized logistic loss: 

A 1 n 

j( W ) = || W || 2 + - E exD y 1 '' x *) - < w > # x< > y 4 ) - # x< > y)» > <u 

U i=i yey 

while the M 3 Ns minimize the L 2 -regularized hinge loss 

Aw) = £ ||wf + - max {%> y 4 ; xi ) - < w > <K xi > S) - <Kx\ y)> } . (2) 

2 n yey 

i—l 



1 We discuss kernels and associated feature maps into a Reproducing Kernel Hilbert Space (RKHS) in the appendix. 




(a) Primal gap, dual gap, and duality gap (b) BMRM gap (and similarly for SVM-Struct) 

Figure 1: Illustration of stopping criterion monitored by various algorithms; convergence rates are stated 
with respect to these stopping criterion. D(a) is the Lagrange dual of J(w), and min w J(w) = 
max a D(ct). Neither the primal gap nor the dual gap is actually measurable in practice since min w J(w) 
(and max a D(ct)) is unknown. BMRM (right) therefore uses a measurable upper bound of the primal gap. 
SVM-Struct monitors constraint violation, which can be also be translated to an upper bound on the primal 
gap. 
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Table 1: Comparison of specialized optimization algorithms for training structured prediction models. 
Primal-dual methods maintain estimation sequences in both primal and dual spaces. Details of the oracle will 
be discussed in Section [5] The convergence rate highlights the dependence on both e and some "constants" 
that are often hidden in the O notation: n, A, and the size of the label space \y\. No formal convergence 
rate is known for SMO on M 3 N, therefore we quote the best known rate for training binary SVMs due to 
lfT2l . The term G in the convergence rate of BMRM and SVM-Struct denotes the maximum L2 norm of the 
features vectors </>(x% y). The convergence rate of Extragradient depends on A in an indirectly way. 



A large body of literature exists on efficient algorithms for minimizing the above objective functions. A 
summary of existing methods, and their convergence rates (iterations needed to find an e accurate solution) 
can be found in Table [T] The e accuracy of a solution can be measured in many different ways. As Figure [T] 
depicts, different algorithms employ different but somewhat related stopping criterion. This must be borne in 
mind when interpreting the convergence rates in Table [T] 

Since ([]} is a smooth convex objective, classical methods such as L-BFGS can directly be applied fT6l . 
Specialized solvers also exist. For instance a primal algorithm based on bundle methods was proposed 
by l22l . while a dual algorithm for the same problem was proposed by Q. Both algorithms converge at 
0(j log(l/e)) rates to an e accurate solution, and, remarkably, their convergence rates are independent of 
n the number of data points, and \y\ the size of the label space. It is widely believed in optimization (see 
e.g. Section 9.3 of (6)) that unconstrained smooth strongly convex objective functions can be minimized in 
0(log(l/e)) iterations, and these specialized optimizers also achieve this rate. 

On the other hand, since ^ is a non-smooth convex function, efficient algorithms are harder to come 
by. SVM-Struct was one of the first specialized algorithms to tackle this problem, and |23] derived an 



0(G 2 /Ae 2 ) rate of convergence. Here G denotes the maximum L2 norm of the feature vectors </>(x% y). By 
refining their analysis, l22ll proved a 0(G 2 /Xe) rate of convergence for a related but more general algorithm, 
which they called bundle methods for regularized risk minimization (BMRM). At first glance, it looks like 
the rates of convergence of these algorithms are independent of |3^|- This is somewhat misleading because, 
although the dependence is not direct, the convergence rates depend on G, which is in turn implicitly related 
to the size of y. 

Optimization algorithms which solve ^ in the dual have also been developed. For instance, the algorithm 

proposed by (7) performs exponentiated gradient descent in the dual and converges at O ( lo rates. 

Again, these rates of convergence are not surprising given the well established lower bounds of fl3ll who 
show that, in general, non- smooth optimization problems cannot be solved in fewer than Q(l/e) iterations by 
solvers which treat the objective function as a black box. 

In this paper, we present an algorithm that provably converges to an e accurate solution of ^ in O 

iterations. This does not contradict the lower bound because our algorithm is not a general purpose black box 
optimizer. In fact, it exploits the special form of the objective function ([2]). Before launching into the techni- 
cal details we would like to highlight some important features of our algorithm. First, compared to existing 
algorithms our convergence rates are better in terms of \y\, A, and e. Second, our convergence analysis is 
tighter in that our rates are with respect to the duality gap. Not only is the duality gap computable, it also 
upper bounds the primal and dual gaps used by other algorithms (see Figure [TJ. Finally, our cost per iteration 
is comparable with other algorithms. 

To derive our algorithm we extend the recent excessive gap technique of 1 14] to Bregman projections 
and establish rates of convergence (Section [2]). This extension is important because the original gradient 
based algorithm for strongly convex objectives by fT4l does not admit graphical model factorizations, which 
are crucial for efficiency in structured prediction problems. We apply our resulting algorithm to the M 3 N 
objective in Section [3] A straightforward implementation requires 0(|^|) computational complexities per 
iteration, which makes it prohibitively expensive. We show that by exploiting the graphical model structure 
of y the cost per iteration can be reduced to 0(log |^|) (Section [4]). Finally we contrast our algorithm with 
existing techniques in Section [5] The appendix contains some technical proofs and details on how to handle 
kernels. 

2 Excessive Gap Technique with Bregman Projection 

The following three concepts from convex analysis are extensively used in the sequel. Define R := R U {00}. 

Definition 1 A convex function f : R n — > R is strongly convex with respect to a norm || • || if there exists 
a constant p > such that f — % || • || 2 is convex, p is called the modulus of strong convexity of f, and for 
brevity we will call f p-strongly convex. 

Definition 2 Suppose a function f : R n — » R is differentiable on Q C R n . Then f is said to have Lipschitz 
continuous gradient (leg) with respect to a norm || • || if there exists a constant L such that 

l|V/(w) - V/(w')|| < L||w- w'|| Vw,w' e Q. (3) 

For brevity, we will call f L-l.c.g. 

Definition 3 The Fenchel dual of a function f : R n — » R is a function f* : R n — >• R defined by 

r(w*)= sup {(w,w*)-/(w)} (4) 

Strong convexity and Leg are related by Fenchel duality according to the following lemma: 
Lemma 4 (| 8 , Theorem 4.2.1 and 4.2.2]) 

1. If f : R n — » R is p-strongly convex, then f* is finite on R n and f* is ^-l.c.g. 

2. If f : R n — )> R is convex, differentiable on R n , and L-l.c.g, then f* is \ -strongly convex. 

Let Qi and Q2 be subsets of Euclidean spaces and A be a linear map from Qi to Q2. Suppose / and 
g are convex functions defined on Qi and Q2 respectively. We are interested in the following optimization 
problem: 



min J(w) where J(w) := /(w) + g*(Aw) = /(w) + max {(Aw, a) — g(cx)} . 



(5) 



Figure 2: Illustration of excessive gap. When jik decreases to 0, the "overlap" of J Mfc (w) and D(a) becomes 
narrower and narrower. And both J^ k (w^) and D(ctk) need to lie in this "narrow tube". 



We will make the following standard assumptions: a) Q2 is compact; b) with respect to a certain norm on Qi, 
the function / defined on Qi is p- strongly convex but not necessarily Leg, and c) with respect to a certain 
norm on Q2, the function g defined on Q2 is L g -Lc.g and convex, but not necessarily strongly convex. If we 
identify /(w) with the regularizer and / (Aw) with the loss function, then it is clear that ([5]) has the same 
form as ([T]) and ([2]). We will exploit this observation in Section [3] 

The key difficulty in solving §5§ arises because g* and hence J may potentially be non-smooth. Our aim 
is to uniformly approximate J(w) with a smooth and strongly convex function. Towards this end let d be a a 
strongly convex smooth function with the following properties: 

min d(a) = 0, olq = argmind(a), and D := max d(ct). 

aeQ 2 cxeQ 2 cxeQ 2 

In optimization parlance, d is called a prox-f unction. Let /i G R be an arbitrary positive constant, and 

(# + /id)*(w)= sup {(a,w) -g{pL) -fid(ot)}. (6) 

a.eQ 2 

If D < 00 then it is easy to see that (g + \i d)* is uniformly close to g*\ 

<?*(w) -pD<(g + /id)*(w) < £*(w). (7) 
We will use (g + /id)* to define a new objective function 

J M (w) := f(w) + + /xd)*(Aw) = /(w) + max {(Aw, a) - ^(a) - \i d(a)} . (8) 

If some mild constraint qualifications hold [e.g. Theorem 3.3.5 [5) one can write the dual D{ot) of J(w) 
using A T (the transpose of A) as 

D{cl) := -g(a) - (-A T a) = -^(a) - max {(-Aw, a) - /(w)} , (9) 

and assert the following 

inf J(w) = sup D(a), and J(w) > D(a) VwGQi,aeQ 2 . (10) 

The key idea of excessive gap minimization pioneered by ifTH is to maintain two estimation sequences {w/e} 
and {a/c}, together with a diminishing sequence {//&} such that 



J^ k (^k) < D(oL k ), and lim /x fe = 0. 



(11) 



The idea is illustrated in Figure [2] In conjunction with ( fT0| ) and {71, it is not hard to see that {w^} and {otk} 
approach the solution of min w J(w) = max a D(a). Using (5), ([8]), and <TT>, we can derive the rate of 
convergence of this algorithm: 

J (wfc) - D(a fc ) < J Mfc (w fe ) + /ifeD - I>(a fc ) < /x fc D. (12) 

In other words, the duality gap is reduced at the same rate at which ji^ approaches 0. All that remains to turn 
this idea into an implementable algorithm is to answer the following two questions: 

1. How to efficiently find initial points wi, ol\ and /ii that satisfy ( [TT] ). 

2. Given w^, otk, and //&, how to efficiently find w^+i, CKfe+i, and /ifc+i which maintain ([TT]). 



Algorithm 1: Excessive gap minimization 



Input: Function / which is strongly convex, convex function g which is leg. 
Output: Sequences {w/c}, {cx k }, and {fik} that satisfy (pj}, with lim^oo /i/~ = 0. 

1 Initialize: Let ao = minimizer of d over Q 2 , [i\ — ^,wi = w(a ), ol\ = V ^ao, — ^- VD(cxq)^ 

2 forfc = l,2,...do 
3 

OL <r- (1 - T k )OL k + T/ e a /ifc (Wfc). 
W fe+ i <- (1 - T fe )w fe + TfcW(d). 

a^^(a Mfc (w fc ), (T f^V^(d)). 

/lfc+1 <- (1 - Tfe)/ife. 



To achieve the best possible convergence rate it is desirable to anneal (i k as fast as possible while still 
allowing w& and to be updated efficiently. 1 14 ] gave a solution based on Euclidean projections, where 

decays at l//c 2 rate and all updates can be computed in closed form. We now extend his ideas to updates 
based on Bregman projections 2 , which will be the key to our application to structured prediction problems 
later. Since d is differentiate, we can define a Bregman divergence based on it: 

A(d, a) := d(a) - d(a) - (Vd(a), d - a) . (13) 

Given a point a and a direction g, we can define the Bregman projection as: 

V(ct, g) := argmin{A(d, a) + (g, d — a)} = argmind(a) — (Vd(cx) — g, d) . 

Since / is assumed to be p-strongly convex, it follows from LemmaHthat — D(a) is leg. If we denote 
its leg modulus as L, then an easy calculation [e.g. Eq. (7.2) 14] shows that 

2 

1,2 



■L 9 , where ||A|| 12 := max (Aw, a) . (14) 

P ||w|| = ||a||=l 

For notational convenience, we define the following two maps: 

w(a) := argmax (-Aw, a) - /(w) = V/*(-A T a) (15a) 

a M (w) := argmax {(Aw, a) — #(a) — /id(a)} = V(g + /id)* (Aw). (15b) 

Since both / and (g+nd) are strongly convex, the above maps are unique and well defined. With this notation 
in place we now describe our excessive gap minimization method in Algorithm [T] Unrolling the recursive 
update for /ifc+i yields 

k + 1 (fc + l)(fc)...2 L 6 £ 

- (1 - r.) ^ - — Mfc - (jfe + 3)(jb + 2) " - {k + m + 2)V (16) 

Plugging this into ( fT2| ) and using fl4| ) immediately yields a 0(1/ y/e) rate of convergence of our algorithm: 
Theorem 5 (Rate of convergence for duality gap) The sequences {w^} and {cx k } in Algorithm^satisfy 

J(w fc ) - D(« fc ) < = ^fc+Dfr + a) (— + • <"> 

All that remains is to show that 

Theorem 6 77ie update rule of Algorithm^guarantees that fTT] ) is satisfied for all k > 1. 

Proof: See Appendix [A| ■ 

When stated in terms of the dual gap (as opposed to the duality gap) our convergence results can be 
strengthened slightly. 



2 1 14 1 did discuss updates based on Bregman projections, but just for the case where / is convex rather than strongly 
convex. Here, we show how to improve the convergence rate from 0(l/e) to 0(1/ y/e) when / is strongly convex. 



Corollary 7 (Rate of convergence for dual gap) The sequence {cxk} in Algorithm^satisfy 

nr \ nf w 6%*) f \\A\\ 2 12 \ 

max Diet) - D(cx k ) < — — ^——^ — r = — — r — + L a , (18) 

cxeQ 2 v ; v J ~ a(k + l)(fc + 2) a(k + l)(k + 2) I p 9 J 

where a* := argmax CKG g 2 D(ol). Note d(a*) w tighter than the D in ( [17] ). 

Proof: See Appendix |B] ■ 
3 Training Max-Margin Markov Networks 

In the max-margin Markov network (M 3 N) setting fT8lL we are given n labeled data points {x%y*}^_ 1 , 
where x* are drawn from some space X and y % belong to some space y. We assume that there is a feature 
map 4> which maps (x, y) to a feature vector in W. Furthermore, for each x\ there is a label loss t y := 
£(y, y 2 ; x 2 ) which quantifies the loss of predicting label y when the correct label is y\ Given this setup, the 
objective function minimized by M 3 Ns can be written as 

A 1 n 

J(w) = - ||w|| 2 + - £™«{4 - <w, </>*>} , (19) 

2 = 1 

where we used the shorthand ?/?y := 0(x\y z ) — </>(x%y). To write ([19]) in the form of ([5]), we define 
Q 1 = MP, A to be a (n \y\)-by-p matrix whose (i, y)-th row is (— i/?y) T , 

/(w) = £ ||w||2, and #*(u) = - Vmax {4 + u y } . 
z n z — ' y ^ ^ 

i 

Now, # can be verified to be: 

5(a) = |-E,E y 44 if<>0,andE y ^ = iVi (2Q) 
[+oo otherwise. 

The domain of g is Q2 = £ n •= |ck G [0, l] n l^l : ^ y = V z j, which is convex and compact. Using 

the L2 norm on Qi (/.£., ||w|| = ) 1//2 ), / is clearly A-strongly convex. Similarly, if we use the L\ 

norm on Q 2 (i.e., \\at\\ = J2 i J2 y \ct y \), then g is 0-l.c.g. By noting that /*(— A T a.) = ^ol 1 AA T ol, one 
can write the dual form D{ot) : S n \-> R of J(w) as 

Z)(a) = - 5 (a)-r(-A T a) = -^a T ^ T a + ^^4aj r , a e 5". (21) 

i y 

3.1 Rates of Convergence 

A natural prox-function to use in our setting is the relative entropy with respect to the uniform distribution, 
which is defined as: 

n 

d ( a ) = EE ft y lo S a y + lo S n + lo S M ' (22) 

i=l y 

The relative entropy is 1 -strongly convex in S n with respect to the L\ norm [e.g.,|4l Proposition 5.1]. Fur- 
thermore, d(a) < D = log \y\ for a G <S n , and the norm of A can be computed via 

{p n "| 

(Aw,u) '-J2 w i =1 >J2J2 \ u y \ = 1 r = max ll^yll ' 
*=i *=i ye^ J 

where || is the Euclidean norm of i/j y . Since / is A-strongly convex and L g = 0, plugging this expression 
of || A || x 2 into ([17]) and fl8] ), we obtain the following rates of convergence for our algorithm: 

J( n m ^ 61og|y| max,, y |[^|[ 2 6KL(a*||a ) max, y ||^| 

«/(wfe) — Diexk) < -j-, — — r — and max D(cx) — D(cx k ) < — —-^ ~r r J — — 

v } v J ~ (fc + l)(fc + 2) A «gq 2 v 7 v } ~ (fc + 1)0 + 2) A 

where KL(a* | |ao) denotes the KL divergence between a* and the uniform distribution cxq. Recall that for 

distributions p and q the KL divergence is defined as KL(p| |q) = ^ i pi In ^. 

Therefore to reduce the duality gap and dual gap below e, it suffices to take the following number of steps 

respectively: 

Duality gap: 2 + max \\^ y \\ \J^j^ Dual gap: 2 + max \\^ y \\ y 6KL KII«o) _ (23) 



3.2 Computing the Approximation J M (w) and Connection to CRFs 

In this section we show how to compute J^(w). Towards this end, we first compute (g + /id)*(u). 

Lemma 8 The Fenchel dual of(g + /id) is given by 

a n I V + ^ \ 
+ M d)*(u) = - E lo S E ex P " ^ \y\ , (24) 

a^d (i, y)-£/z element of its gradient can be written as 

(v( 5 +^r(u)); = ^exp /x> P ( ! ^^) • w 

Proof: See Supplementary Material|E| ■ 

Using the above lemma, plugging in the definition of A and -0* and assuming that i l , = 0, we get 

J M (w) = /(w) + (g + ^)*(Aw) = - ||w||l - ^ V logpfrV; w) - /i log |^| , (26) 

i—1 

I \ i \ / 4 + (w^(xSy)) \ 
where P(y| x 5 w ) °c exp I — I . 

This interpretation clearly shows that the approximation J M (w) essentially converts the maximum margin 
estimation problem ([2]) into a CRF estimation problem ([I]). Here /i determines the quality of the approxi- 
mation; when /i —> 0, p(y|x*; w) tends to the delta distribution with the probability mass concentrated on 
argmax y £ l + (w, </>(x% y)). Besides, the loss £ % rescales the distribution. 

Given the above interpretation, it is tempting to argue that every non- smooth problem can be solved by 
computing a smooth approximation J M (w), and applying a standard smooth convex optimizer to minimize 
J M (w). Unfortunately, this approach is fraught with problems. In order to get a close enough approximation 
of J(w) the /i needs to be set to a very small number which makes J M (w) ill-conditioned and leads to 
numerical issues in the optimizer. The excessive gap technique adaptively changes the /i in each iteration in 
order to avoid these problems. 

4 Efficient Implementation by Exploiting Clique Decomposition 

In the structured large margin setting, the number of labels \y\ could potentially be exponentially large. For 
example, if a sequence has I nodes and each node has two states, then \y\ = 2 l . A naive implementation of the 
excessive gap reduction algorithm described in the previous section requires maintaining and updating 0( \y\ ) 
coefficients at every iteration, which is prohibitively expensive. With a view to reducing the computational 
complexity, and also to take into account the inherent conditional independence properties of the output 
space, it is customary to assume that y is endowed with a graphical model structure; we refer the reader to 
for an in-depth treatment of this issue. For our purposes it suffices to assume that £{y, y z ; x*) and 0(x* , y) 
decompose according to the cliques 3 of an undirected graphical model, and hence can be written (with some 
abuse of notation) as 

4 = ^(y,y i ;x i ) = E%c^;x i ) = E4' <K**>y) = © <K*>Vc\ and ^ = © (27) 
cec cec ceC ceC 

Here C denotes the set of all cliques of the graphical model and © denotes vector concatenation. More 
explicitly, ty % is the vector on the graphical model obtained by accumulating the vector ty % c on all the cliques 
c of the graph. 

Let h c (y c ) be an arbitrary real valued function on the value of y restricted to clique c. Graphical models 
define a distribution p(y) on y G y whose density takes the following factorized form: 

p(y) oc q(y) = Y[ ex P ( h c(y c )) • (28) 

cec 

The key advantage of a graphical model is that the marginals on the cliques can be efficiently computed: 

m y c := E = E II exp(M* c /)) • 

z:z\ c =y c z:z\ c =y c c'eC 



3 Any fully connected subgraph of a graph is called a clique. 



where the summation is over all the configurations z in y whose restriction on the clique c equals y c . Al- 
though y can be exponentially large, efficient dynamic programming algorithms exist that exploit the factor- 
ized form ( [28] ), e.g. belief propagation ifTTTl . The computational cost is 0{s UJ ) where s is the number of states 
of each node, and uj is the maximum size of the cliques. For example, a linear chain has uj = 2. When uj is 
large, approximate algorithms also exist (24] [2] [9). In the sequel we will assume that our graphical models 
are tractable, i.e., uj is low. 

4.1 Basics 

At each iteration of Algorithm]!] we need to compute four quantities: w(a), VD(a), a M (w), and V(a, g). 
Below we rewrite them by taking into account the factorization ( [27] ), and postpone to Section |4~2] the dis- 
cussion on how to compute them efficiently. Since a l y > and J2 y a y ~ \-> me { a y : Y ^ ^} form an 
unnormalized distribution, and we denote its (unnormalized) marginal distribution on clique c by 

< := E , <■ (29) 

L — 'z:z\ c =y c 

The feature expectations on the cliques with respect to the unnormalized distributions a are important: 

F [<!«] ==£«> and F[^ c ;a] := £F [<;a] . (30) 

Clearly, if for all i the marginals of a on the cliques (i.e., {a^ c : z, c, y c } in ( [29] )) are available, then these 
two expectations can be computed efficiently. 

• w(a): As a consequence of ([27]) we can write ty % = © . Plugging this into dl5ab and recalling 

y cec yc 

that V/*(— A T a) = = ^A T ol yields the following expression for w(a) = ^ A T a: 

w («) = \ E E a ^ = X E E 4 (© <) = X J c f E F [< J «] ) = ^ ® F[tf c ; «]. (31) 

i y i y ^ ' \ i / 

• VD(a): Using §T\) and the definition of w(a), the (i, y)-th element of VD(a) can be written as 

(VD(a)); =4-1 (^ T a); = 4 - «,w(a)) = £ (4 e - 1 «,Ffok;a]>) . (32) 

• a /J (w): Using ( |15b| l and ( |25l >, the (i,y)-th element of a At (w) given by (V(<? + /j,d)* (Aw)Y y can be 
written as 

1 exp(/i- 1 (4-(^*,w))) 1 nc^^" 1 ^-^'^))) 



(«m( w )) 



y n Ey exp (4, - w))) n £ y , n c ^p (m" 1 (4, - (^,w £ 



• V(a, g): Since the prox-function d is the relative entropy, the (i, y)-th element of V(a, g) is 

1 a^exp(-^) 



(33) 



1 a*exp(— of*) 
(y( Q ,g)V =- - y , P{ y i - . (34) 



4.2 Efficient Computation 

We now show how the algorithm can be made efficient by taking into account ( [27] ). Key to our efficient 
implementation are the following four observations from Algorithm [T] when applied to the structured large 
margin setting. In particular, we will exploit the fact that the marginals of a& can be updated iteratively. 



• The marginals of a /ifc (w fc ) and d can be computed efficiently. From ( [33] ) it is easy to see that 
c^ fc (wfc) can be written as a product of factors over cliques, that is, in the form of ( [28] ). Therefore, 
the marginals of a Mfc (w/c) can be computed efficiently. As a result, if we keep track of the marginal 
distributions of then it is trivial to compute the marginals of d = (1 — r/ e )a/ c + Tkcx^ k (w&). 



• The marginals of d can be computed efficiently. Define r\ = ( 1 _ r ^ fc ) /Xfc • By plugging in ( [32] ) and ( [33] ) 
into ( [34] ) and observing that VD(a) can be written as a sum of terms over cliques obtains: 

4 = (^(a^Wfc^V^d))); cx (a w (w fe ));exp(-^(VD(d));) 

= []exp [}T k 1 (4 c - «, (w fc ) c » - t?4 c + r?A _1 «,F[^ C ; d])) . (35) 

C 

Clearly, d factorizes and has the form of ( [28] ). Hence its marginals can be computed efficiently. 



Algorithm 2: Max-margin structured learning using clique factorization 

Input: Loss functions {£ y } and features {?/>y }, a regularization parameter A, a tolerance level e > 0. 
Output: A pair w and ol that satisfy J(w) — D(a) < e. 

1 Initialize: k <- 1, /ii <- { max, )y ||^|| 2 , «o <- • • • , ^) G R nM .; 

2 Update wi «— w(a ) = j ©cgc ^[^c; c*o]> ol\ <— V ^ao, — ^VD(ao)^ and compute its 
marginals.; 

3 while J(wfc) - ^(a^) > e do /* Termination criteria: duality gap falls below e */ 

4 ^ <- 4a- ; 

5 Compute the marginals of a Mfc (w^) by exploiting ( [33] ). ; 

6 forall the cliques c e C do 

7 Compute the marginals d c by convex combination: d c ^— (1 — r/ c )(a/ c ) c + Tk(ct fIk (wfc)) c .; 

8 |_ Update the weight on clique c: (w fe +i) c ^- (1 - r fc ) (w fe ) c + ^ ^ F [^* c ; d c ] . ; 

9 Compute the marginals of d by exploiting ( [35] ) and using the marginals {6l c }. ; 

10 forall the cliques c e C do 

11 |_ Update the marginals (ctk) c by convex combination: (ak+i)c <— (1 — 7fc)(c*/c)c + Tfca c . 

12 |_ Update fi k+1 <- (1 - r fe )/ife, fc <- fc + 1.; 

13 return a^J o^.; 



• The marginals of can be updated efficiently. Given the marginals of a, it is trivial to update the 
marginals of a^+i since a^+i = (1 -r^a^ +Tfca. For convenience, define a c := {a^ c : z, y c }. 

• Wfe can be updated efficiently. According to step 5 of Algorithm [T] by using ( [3T] ) we have 

(wfc+i) c = (1 - r fe )(w fe ) c + r fe (w(a)) c = (1 - r fe )(w fe ) c + r fc A _1 F[^ c ; a]. 

Leveraging these observations, Algorithm[2]provides a complete listing of how to implement the excessive 
gap technique with Bregman projections for training M 3 N. It focuses on clarifying the ideas; a practical 
implementation can be sped up in many ways. The last issue to be addressed is the computation of the primal 
and dual objectives J(wfc) and D(ctk), so as to monitor the duality gap. See Appendix [C] for details. 

4.3 Kernelization 

When nonlinear kernels are used, the feature vectors (f) y are not expressed explicitly and only their inner 
products can be evaluated via kernels on the cliques: 

(^ ) ^):=M(x i ) y),(x^y')) = ^fcc((x i ,y c ) ) (x^^)), where ^(x^ c ), (x^)):=(v4,<) • 

C 

Algorithm [2] is no longer applicable because no explicit expression of w is available. However, by rewriting 
Wfc as the feature expectations with respect to some underlying distribution which can be updated implicitly, 
all the updates and objective function evaluations can still be done efficiently. Details are in Appendix [P] 

4.4 Efficiency in Memory and Computation 

For concreteness, let us consider a sequence as an example. Here the cliques are just edges between consec- 
utive nodes. Suppose there are I + 1 nodes and each node has s states. The memory cost of Algorithm [2] is 
0(nls 2 ), due to the storage of the marginals. The computational cost per iteration is dominated by calculating 
the marginals of d and d, which is 0(nls 2 ) by standard graphical model inference. The rest operations in 
Algorithm [2] cost 0(nls 2 ) for linear kernels. If nonlinear kernels are used, then the cost becomes 0(n 2 ls 2 ) 
(see Appendix |D|). 

5 Discussion 

Structured output prediction is an important learning task in both theory and practice. The main contribution 
of our paper is two fold. First, we identified an efficient algorithm by iTTH for solving the optimization 
problems in structured prediction. We proved the 0(1/ y/e) rate of convergence for the Bregman projection 
based updates in excessive gap optimization, while 1 14 ] showed this rate only for projected gradient style 
updates. In M 3 N optimization, Bregman projection plays a key role in factorizing the computations, while 
technically such factorizations are not applicable to projected gradient. Second, we designed a nontrivial 



application of the excessive gap technique to M 3 N optimization, in which the computations are kept efficient 
by using the graphical model decomposition. Kernelized objectives can also be handled by our method, and 
we proved superior convergence and computational guarantees than existing algorithms. 

When M 3 Ns are trained in a batch fashion, we can compare the convergence rate of dual gap between our 
algorithm and the exponentiated gradient method [ExpGrad, 7]. Assume ao, the initial value of a, is the 
uniform distribution and a* is the optimal dual solution. Then by ([23]), we have 



Ours: max||Vd|J 6KL(< r l|ao) , ExpGrad: max U*\\ 2 

i,y 11 y 11 V Ae i,y 11 y 11 Ae 

It is clear that our iteration bound is almost the square root of ExpGrad, and has much better dependence on 
e, A, max^ y ||i/?y ||, as well as the divergence from the initial guess to the optimal solution KL(a* | |ao). 

In addition, the cost per iteration of our algorithm is almost the same as ExpGrad, and both are governed 
by the computation of the expected feature values on the cliques (which we call exp-oracle), or equiva- 
lently the marginal distributions. For graphical models, exact inference algorithms such as belief propaga- 
tion can compute the marginals via dynamic programming [11]. Finally, although both algorithms require 
marginalization, they are calculated in very different ways. In ExpGrad, the dual variables a correspond 
to a factorized distribution, and in each iteration its potential functions on the cliques are updated using the 
exponentiated gradient rule. In contrast, our algorithm explicitly updates the marginal distributions of 
on the cliques, and marginalization inference is needed only for d and d. Indeed, the joint distribution a 
does not factorize, which can be seen from step 7 of Algorithm [T] the convex combination of two factorized 
distributions is not necessarily factorized. 

Marginalization is just one type of query that can be answered efficiently by graphical models, and another 
important query is the max a-posteriori inference (which we call max-oracle): given the current model w, 
find the argmax in ([2]). Max-oracle has been used by greedy algorithms such as cutting plane (BMRM and 
SVM- Struct) and sequential minimal optimization [SMO,[l7] Chapter 6]. SMO picks the steepest descent 
coordinate in the dual and greedily optimizes the quadratic analytically, but its convergence rate is slower 
than BMRM by a factor n. The max-oracle again relies on graphical models for dynamical programming (9j, 
and many existing combinatorial optimizers can also be used, such as in the applications of matching lf2TTl 
and context free grammar parsing |fT9l . Furthermore, this oracle is particularly useful for solving the slack 
rescaling variant of M 3 N proposed by ll23ll : 

A 1 n 

J(w) = - || w|| 2 + - £ max {%, y*; x*) (l - (w, 0(x«, y<) - 0(x*, y)» } . (36) 

i=l 

Here two factorized terms get multiplied, which causes additional complexity in finding the maximizer. (TJ 
Section 1.4.1] solved this problem by a modified dynamic program. Nevertheless, it is not clear how Exp- 
Grad or our method can be used to optimize this objective. 

In the quest for faster optimization algorithms for M 3 Ns, the following three questions are important: 
how hard is it to optimize M 3 N intrinsically, how informative is the oracle which is the only way for the 
algorithm to access the objective function, and how well does the algorithm make use of such information. 
The superiority of our algorithm suggests that the exp-oracle is more informative than the max-oracle, and a 
deeper explanation is that the max-oracle is local while the exp-oracle is not (13] Section 1.3]. Hence there 
is no surprise that the less informative max-oracle is easier to compute, which makes it applicable to a wider 
range of problems such as ( [36] ). Moreover, the comparison between ExpGrad and our algorithm shows that 
even if the exp oracle is used, the algorithm still needs to make good use of it in order to converge faster. 

For future research, it is interesting to study the lower bound complexity for optimizing M 3 N, including 
the dependence on e, n, A, y, and probably even on the graphical model topology. Empirical evaluation of 
our algorithm is also desirable, along the lines of sequence labeling, word alignment, context free grammar 
parsing, etc. 
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Appendix (to be considered in the 13 page limit) 
A Proof of Theorem |6] 

To prove Theorem [6j we begin with a technical lemma. 
Lemma 9 (Lemma 7.2 of KT4\l ) For any ol and a, we have 

D(a) + (VD(a), a - a) > -g{a) + (Aw(a), a) + /(w(a)). 



Proof: Direct calculation by plugging in ( |15a| ) into ^ and using the convexity of g yields 

D(ct) + (VD(a), a - a) = -#(a) + (iw(a),a) + /(w(a)) + (-Vp(a) + Aw(a), a - a) 

> -#(a) + (Aw(a), a) + /(w(a)). ■ 

Furthermore, because d is a-strongly convex, it follows that 

A(a, a) d(a) - d(a) - (Vd(a), a - a) > | ||a - a^ • (37) 

As ao minimizes d over Q2, we have 

(Vd(a ),a-a ) >0 V a e Q 2 - (38) 
We first show that the initial wi and ol\ satisfy the excessive gap condition fTT] ). Since — D is L-l.c.g, so 

D(ai) > £>(a ) + (VD(a ), ax - a > - ||ai - a || 2 
(using defn. of \i\ and ( [37] )) > £>(ao) + (V-D(ao)j ol\ — ao) — /xiA(ai, «o) 

(using defn. of ai) = £>(ao) — /ii min | — — (VD(ao), a — ao) + A(a, ao) } 

(using ( [38] ) and d(ao) = 0) > D(cxo) — [i\ min j — — - (VD(ao),a — ao) + d(a) 

= max {D(qlo) + (VD(ao), a — ao) — /ii d(a)} 

CXEQ2 

(using Lemma[9|) > max {—g{cx) + (Aw(ao), a) + /(w(ao)) — /ii d(a)} 
= <W w i)> 

which shows that our initialization indeed satisfies fTT] ). Second, we prove by induction that the updates in 
Algorithm [T] maintain (pj}. We begin with two useful observations. Using ( fT6] ) and the definition of r k , one 
can bound 

^ +1= (fc + 3) 6 (fc + 2)^- T ^- (39) 
Let /3 := a Mfc (w^). The optimality conditions for \\5b\ imply 

</i fe Vd(/3) - Aw fe + V#(/3), a - /3) > 0. (40) 
By using the update equation for w^+i and the convexity of / 

Jfi k+1 (wfc+i) /(w fc+ i) + max {(iw H i,a) - #(a) - /i fe+ id(a)} 

= /((l - Tfe)w fe + TfcW(a)) 

+ max {(1 - r fe ) (iw fe , a) + r fe (Aw(d), a) - #(a) - (1 - r k ) /Zfcd(a)} 
< max{(l-r fc )T 1 +r fc T 2 }, 

where 7\ = [-/i fc d(a) + (4w fe , a) - <?(a) + /(w fc )] and T 2 = [-g(a) + (Aw(d), a) + /(w(d))] . 
T\ can be bounded as follows 

Ti = -/i fe d(a) + (iwfe, a) - #(a) + /(w fe ) 
(using defn. of A) = { A(a, (3) + d(/3) + (Vd(/3), a - /3» + (4w fc , a) - #(a) + /(w fc ) 

(using gJJ) < -/i fc A(a, /3) - /i fc d(/3) + (-M + V<?(/3), a - /3) + (Aw k , a) - <?(a) + /(w fc ) 
= -fi k A(a, /3) - /i fc d(/3) + (Aw fc , j3) - #(a) + (V<?(/3), a - f3) + /(w fc ) 
(using convexity of g) < A(a, /3) - /i fc d(/3) + (Aw k: /3) - g((3) + /(w fc ) 
(using defn. of /3) = -/i fe A(a, /3) + J Mfc (w fe ) 
(using induction assumption) < — fi k A (a, /3) + D{a k ) 

(using concavity of D) < — /ifcA(a, /3) + D(d) + (VZ}(d), a^ — a) , 

while T 2 can be bounded by using Lemma [9j 

T 2 = -^(a) + (Aw(d), a) + /(w(d)) < L>(d) + (VD(d), a - a) . 



Putting the upper bounds on T\ and Ti together, we obtain the desired result. 

<V+i( w fc+i) < m ax {(1 - r fc ) [-/XfcA(a,/3) + D(a) + (VD(d), a fe - d)] + r fe [£>(d) + (VD(d),a - a)]} 

max {-/i fe +iA(a,/3) + £>(d) + (VD(d), (1 - Tfc)a fe + r fe a - d)} 

(using defn. of d) = max {— /ifc+iA(a, /3) + D(d) + (V-D(d), a — /3)} 
CKGQ2 

= - min {/i fc+1 A(a, /3) - .0(d) - r fc (VL>(d), a - /3» 
(using defn. ofd) = -/i k+1 A(a, /3) + D(a) + r k (V£>(d), d - /3) 

(using ((37J) < -i/ifc+i ||d " /3f + #(d) + r fe (V£>(a), a-0) 
(using <39)) < -ir fe 2 i ||d - f3\\ 2 + D(d) + r k (VD(d), d - /3) 
(using defn. of a fe+ i) = -±L ||a fe+ i - d|| 2 + D(a) + (VD(a),a k+1 - d) 
(by L-Lcgof - D) < £>(a fc+ i). 

B Proof of Corollary [7] 

-D(c*fc+i) > ^ fc+1 (wfe+i) = /(w fe+ i) + max {(4w w ,a) - g(a) - fi k+1 d(a)} 

Ot 

> /(wfe+i) + (Awfc+i, a*) - fif(a*) - ^ fe+ id(a*) 

> -ff(a') + min {/(w) + (Aw, a*)} - /x fe+ id(a*) 

W 

= Z)(a*)-/i fe+1 d(a*). 
C Primal and Dual Objective Evaluation using Clique Decomposition 

We show how to efficiently compute the primal and dual objective function values. The primal objective 
value is easy due to the convenience in computing || || 2 and inner products between and feature vectors. 
Afterwards any MAP algorithm can be used to find the maxy^y. The dual objective §T\) is also easy since 

££4(«^ = £££4.(«*)y = £££4. £ («*&=£ 

i y i y c i c y c y:y| c =j/e i,c,y c 

and the marginals of a k are available. Finally, the quadratic term in D(a k ) can be computed as follows. 



\A T a k \ 



£ 



£ £ (a fe ); c (a fc )^fc c ((x i ) t /c ) ) (x^ ) ^)) ) 

c i,j,y c ,y' c 



where the inner term is the same as the unnormalized expectation that can be efficiently calculated. The last 
formula is only for nonlinear kernels. 

D Kernelizing the Excessive Gap Method for M 3 Ns 

Compared with the linear kernel case, the only difficulty caused by nonlinear kernels is that the w& cannot be 
expressed explicitly. However, if Wfc can be expressed as the expectation of the feature vector with respect to 
some distribution ft G 5 n , then we only need to update w& implicitly via ft, and the inner product between 
Wfc and any feature vector can also be efficiently calculated. We formalize and prove this claim by induction. 

Theorem 10 For all k > 0, there exists ft G S n , such that (wfc) c = ^F[?/> c ; /3k], and /3k can be updated by 

ft+1 = (1 - Tk)@k + Tfcdfc. 



Proof: First, wi = w(ao) 



ce c F[t/> c ; <*o]> so (3\ = glq. Suppose the claim holds for all 1, ... , fc, then 



(w fc+ i)c = (1 - Tk)(w k )c + yF[</> c ; (d fc ) c ] = (1 - r fc )^F[^ c ; ft] + yF[</> c ; (d fc )c] 



F[^ c ; (1 - Tk)(Pk)c + Tfe(d fe ) c ]. 



Therefore, we can set ft+i = (1 — r/e)ft + TkOLk G <S n . 



In general ctk ^ olu^ hence flk 7^ o^. To compute (i/)y c , (wfe) c ) required by ( [35] ), we have 

And by using this trick, all the iterative updates in Algorithm [2] can be done efficiently. So is the evaluation 
of || Wfc || 2 and the primal and dual objectives. We leave the details to the reader. 



Supplementary Material 



E Proof of Lemma [8] 

Proof: Using ([20]) and ([22]) we can write 



(g + /id)*(u) = sup {(u, a) - #(a) - /id(oc)} 

= sup ^ u y a y + ^y a y ~ ^ a y log a y ~ ^ log n ~ ^ log M 

a: G <S 11 ■ ■ 

i y * y * y 

= sup y^y^(u y + ly - /i log a y )a* - /i log n - /i log |y | 

By introducing non-negative Lagrange multipliers cr^ we can write the partial Lagrangian of the above maxi- 
mization problem: 

L(a, a) = sup ^ ^(^ y + ^ y - M log « y )a y — /ilogn — /ilog |^| — f ^ a y - - J • 

aeSn * y ' V y 71 J 

Taking partial derivative with respect a y and setting it to 0, we get 

U y I" £ y ~ /i lOg a y ~ /i ~ Gi = 0. 

Therefore 

exp(^J^) ^ fulfil 



where Zi := exp 



"y 1 "y 



7 " t, / X' I 



Plugging this back to the Lagrangian, we can eliminate both a and Gi and write out the solution of the 
optimization problem in closed form 

^(/i log Zi + /i log n)a y - ii log n - /i log \y\ = - ^ log ^ exp I ^ J - /i log \y | . 



